Skip to content

feat: SM120 (Blackwell Desktop) support for DeepSeek-V4 inference#24692

Open
AliceChenyy wants to merge 16 commits into
sgl-project:mainfrom
AliceChenyy:sm120-dsv4-rebase
Open

feat: SM120 (Blackwell Desktop) support for DeepSeek-V4 inference#24692
AliceChenyy wants to merge 16 commits into
sgl-project:mainfrom
AliceChenyy:sm120-dsv4-rebase

Conversation

@AliceChenyy
Copy link
Copy Markdown

@AliceChenyy AliceChenyy commented May 8, 2026

Summary

Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark, compute 12.0) support for DeepSeek-V4/V3 on SGLang. SM120 desktop Blackwell GPUs lack TMEM, tcgen05, and DeepGEMM support — this PR provides Triton-based fallback kernels for all critical paths and enables CUDA graph capture.

Key changes

New kernels (7 files):

  • mxfp4_moe_sm120_triton.py — Triton fused MXFP4 dequant + GEMM for MoE experts (4.1x vs PyTorch per-GEMM)
  • flash_mla_sm120_triton.py — Triton FlashMLA sparse decode kernel (3.2–5.4x vs FlashInfer fallback)
  • sm120_mqa_triton.py — FP8 paged MQA with wq-precompute + vectorized batch (CUDA graph compatible)
  • flash_mla_sm120_fallback.py / sm120_mqa_fallback.py / mxfp4_moe_fallback.py — PyTorch fallback paths
  • test_sm120_mqa_fallback.py — Unit tests for MQA fallback correctness

SM120 guards (10 modified files):

  • DeepGEMM / tilelang MHC disabled on SM120 (no TMEM/tcgen05)
  • NSA backend: tilelang default, skip DeepGEMM metadata allocation
  • FlashMLA: SM120 adapter in deepseek_v4_backend.py
  • MoE: auto-select marlin backend on SM120
  • 3 CUDA-graph-breaking paths fixed (.unique()/.item()/.nonzero() → vectorized)

Bug fix (found during latest-image validation):

  • FlashMLA Triton kernel: handle KV cache uint8 dtype (upstream changed from float8_e4m3fn)

Results (8× RTX PRO 6000, TP=8, CUDA graph)

On sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130):

Metric Value
GSM8K 5-shot (200q) 99.0% accuracy
Decode (BS=1) 11.40 tok/s (TPOT = 87.7ms)
CUDA graph capture ✅ all batch sizes captured

On older nightly-dev-20260430 (sgl-kernel 0.4.1, PyTorch 2.9.1+cu129):

Metric Value
GSM8K 5-shot (200q) 98.0% accuracy
Decode (BS=1) 10.26 tok/s (TPOT = 97.5ms)
CUDA graph speedup 2.4× vs without (4.36 → 10.26 tok/s)

Motivation

  • SM120 is desktop Blackwell (RTX 5090, RTX PRO 6000) — no server-class features (TMEM, tcgen05, NVSwitch)
  • Prior to this PR, SGLang cannot run DSv4 on SM120 at all (DeepGEMM JIT crash, no MXFP4 MoE support)
  • Enables developer/researcher access to DSv4 on workstation GPUs

Notes

  • This is a rebase of feat: SM120 support for DeepSeek-V4 inference #24047 (which targeted deepseek_v4 branch) onto main
  • All SM120 kernel code is identical between the two PRs (verified by diff)
  • SM120 kernels are guarded by is_sm120_supported() — zero impact on SM100/SM103 paths
  • Environment variables: SGLANG_SM120_TRITON_FLASHMLA=1 (default on), SGLANG_SM120_MQA_FALLBACK=0 (default off)

Test plan

  • GSM8K 5-shot 200q: 99.0% on latest sglang:dev-cu13 image
  • GSM8K 5-shot 200q: 98.0% on older nightly-dev-20260430 image
  • Single request sanity check (15+27=42, capital of France, Chinese translation)
  • TPOT benchmark: 87.7ms BS=1 on dev-cu13 (11% faster than old image)
  • CUDA graph capture: all batch sizes captured successfully
  • Unit tests: test_sm120_mqa_fallback.py
  • PDL enabled on SM120: verified correct (GSM8K 10/10)
  • CI (no SM120 runner in CI — tested on local 8× RTX PRO 6000)

CI States

Latest PR Test (Base): ❌ Run #26232048205
Latest PR Test (Extra): ❌ Run #26232047397

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support and fallbacks for NVIDIA SM120 (Blackwell Desktop) architectures, which lack certain hardware features like TMEM/tcgen05 required by DeepGEMM and FlashMLA. It implements both pure-PyTorch and optimized Triton fallback kernels for FlashMLA, MQA logits, and MXFP4 MoE operations. Additionally, it updates environment configurations, metadata handling, and server arguments to automatically detect SM120 and route to these compatible implementations. Review feedback highlighted misleading function aliasing in the MoE implementation, the replacement of descriptive assertion messages with "TODO" strings, and inconsistent backend selection logic in the new entrypoints.

Comment on lines +135 to +137
from sglang.srt.layers.moe.fused_moe_triton.mxfp4_moe_sm120_triton import (
mxfp4_moe_forward_triton as mxfp4_moe_forward_fallback,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The alias mxfp4_moe_forward_triton as mxfp4_moe_forward_fallback is highly misleading. It imports the optimized Triton implementation but names it as if it were the fallback. This is especially confusing given that a separate PyTorch fallback implementation exists in mxfp4_moe_fallback.py.

This can cause significant confusion during future maintenance. Please rename the alias to accurately reflect the imported function, for example, by simply importing mxfp4_moe_forward_triton and using that name directly.

Suggested change
from sglang.srt.layers.moe.fused_moe_triton.mxfp4_moe_sm120_triton import (
mxfp4_moe_forward_triton as mxfp4_moe_forward_fallback,
)
from sglang.srt.layers.moe.fused_moe_triton.mxfp4_moe_sm120_triton import (
mxfp4_moe_forward_triton,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — now imports mxfp4_moe_forward_triton directly without aliasing.

Comment on lines +67 to +68
assert head_dim == 128, "TODO"
assert block_size == 64, "TODO"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assert messages have been changed from descriptive explanations to simply "TODO". The previous implementation also had these asserts, but the original messages explained why these values were hardcoded (e.g., "torch reference impl hardcodes DSV4 indexer head_dim=128"). Please consider restoring the more descriptive messages to improve code maintainability.

Suggested change
assert head_dim == 128, "TODO"
assert block_size == 64, "TODO"
assert head_dim == 128, "Vectorized torch impl hardcodes DSV4 indexer head_dim=128"
assert block_size == 64, "Vectorized torch impl hardcodes block_size=64 cache layout"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — restored descriptive assert messages.

_use_triton_flashmla = os.environ.get("SGLANG_SM120_TRITON_FLASHMLA", "1") == "1"


def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In flash_mla_with_kvcache_entrypoint, the backend selection logic for SM120 is confusing. The function takes a backend parameter (from SGLANG_HACK_FLASHMLA_BACKEND), but this parameter is ignored when _is_sm120 is true. Instead, the choice between the Triton kernel and the PyTorch fallback is controlled by a separate environment variable, SGLANG_SM120_TRITON_FLASHMLA.

To improve clarity, it would be better to unify this control flow. Consider using the backend parameter to also control the implementation path on SM120, for instance, by checking for values like "triton" or "torch".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the SGLANG_HACK_FLASHMLA_BACKEND env var has been removed. On SM120 the backend is now hardcoded to "kernel" (Triton path), with SGLANG_SM120_TRITON_FLASHMLA env var as the only escape hatch to fall back to PyTorch attention if needed for debugging. This simplifies the control flow as the gemini-bot suggested.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to flash_mla_with_kvcache_sm120. Also, please move this function to a standalone file. It shouldn't be put in the fallback file

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created flash_mla_sm120.py as the standalone file with the function renamed to flash_mla_with_kvcache_sm120. Deleted the old flash_mla_sm120_fallback.py.

@AliceChenyy AliceChenyy force-pushed the sm120-dsv4-rebase branch 3 times, most recently from 8779574 to a4b24dd Compare May 9, 2026 04:20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does here need add _is_sm120() ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is_cuda covers SM120

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean and not _is_sm120

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added and not _is_sm120 guards at all DeepGEMM/tilelang paths in nsa_backend.py.

# ── Graph-safe routing: flatten topk assignments ──
# token_ids[slot] = which row of A (original token index)
# expert_ids[slot] = which expert's weights to use
flat_expert_ids = topk_ids.reshape(-1).contiguous() # [M*topk]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topk_ids can be -1 for padded/filtered tokens, but this path passes them directly as Triton expert ids. Could we use safe_expert_ids = topk_ids.clamp_min(0) for loads and set the invalid slots' output to zero?? The PyTorch fallback already skips eid_val < 0

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled — see lines 361-363: flat_expert_ids_raw.clamp(min=0) for safe indexing, plus line 441+: invalid slots are zeroed out after the kernel. The Triton kernel itself runs on clamped-to-0 expert IDs (safe for loads), and the output is masked to zero post-kernel.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SM120, should this CP ragged path explicitly raise NotImplementedError instead of falling through to deep_gemm.fp8_mqa_logits, since DeepGEMM is unsupported here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will add if _is_sm120: raise NotImplementedError("CP ragged indexer not supported on SM120")

Comment thread python/sglang/jit_kernel/utils.py Outdated
return False
return get_jit_cuda_arch().major >= 9
arch = get_jit_cuda_arch()
# PDL requires SM100+ datacenter (tcgen05/TMEM); SM120 (desktop Blackwell) lacks these
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is a little bug on old CUTLASS version if the kernel is CUTLASS. But can you elaborate on PDL not working?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDL is now re-enabled on SM120.

@AliceChenyy AliceChenyy requested a review from YAMY1234 as a code owner May 20, 2026 02:53
AliceChenyy added a commit to AliceChenyy/sglang that referenced this pull request May 20, 2026
Address all reviewer feedback from PR sgl-project#24692:
- Use is_sm120_supported() helper instead of raw sm_version checks
- Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE
  with `not is_sm120_supported()` in deepseek_v4.py
- Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py
- Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe

Fix FlashMLA Triton kernel garbled output on latest sglang:dev image:
- Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8.
  The Triton kernel's as_strided() preserved the input dtype, so tl.load
  interpreted FP8 bit patterns as raw integers, corrupting attention scores.
- Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton.

Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130):
- GSM8K 5-shot 200q: 99.0%
- Decode BS=1: 11.40 tok/s, TPOT 87.7ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AliceChenyy and others added 8 commits May 19, 2026 20:55
Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark) support for
DeepSeek-V4 on SGLang, rebased onto main branch.

Key changes:
- Triton MXFP4 MoE kernel for SM120 (no MARLIN/tcgen05 on desktop Blackwell)
- Triton FlashMLA sparse decode kernel for SM120
- MQA wq-precompute with vectorized batch for CUDA graph compatibility
- DeepGEMM/PDL guards for SM120 (no TMEM/tcgen05)
- NSA backend SM120 dispatch (tilelang default, skip DeepGEMM metadata)
- FlashMLA SM120 adapter for deepseek_v4_backend
- 3 CUDA-graph-breaking paths fixed (MoE .unique/.item, NSA/Compressed MQA)

Results (8x RTX PRO 6000, TP=8):
- Decode: 10.26 tok/s BS=1 with CUDA graph (2.4x vs without)
- GSM8K 5-shot: 98.0% accuracy (200 questions)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address all reviewer feedback from PR sgl-project#24692:
- Use is_sm120_supported() helper instead of raw sm_version checks
- Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE
  with `not is_sm120_supported()` in deepseek_v4.py
- Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py
- Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe

Fix FlashMLA Triton kernel garbled output on latest sglang:dev image:
- Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8.
  The Triton kernel's as_strided() preserved the input dtype, so tl.load
  interpreted FP8 bit patterns as raw integers, corrupting attention scores.
- Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton.

Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130):
- GSM8K 5-shot 200q: 99.0%
- Decode BS=1: 11.40 tok/s, TPOT 87.7ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip DeepGEMM transform_sf_into_required_layout (tcgen05 unsupported)
and topk_v2 (128KB SMEM exceeds SM120 99KB limit) on SM120.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rted() directly

Address b8zhong review: do not define SM120 detection in this file,
call the existing util is_sm120_supported() at each usage site instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
is_sm120_supported() already checks is_cuda() internally and is
lru_cached, so the redundant `_is_cuda and` prefix is unnecessary.

- metadata.py: remove _is_cuda/_is_sm120 module vars, call util directly
- flash_mla_sm120_fallback.py: remove _is_cuda, simplify _is_sm120
- nsa_backend.py: remove redundant is_cuda() prefix
- nsa_indexer.py: remove redundant _is_cuda prefix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orted()

The previous commit removed _is_sm120 from metadata.py but indexer.py
still imported it, causing DeepseekV4ForCausalLM to fail to register
and cascading CI failures across all DSv4 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AliceChenyy and others added 4 commits May 19, 2026 22:17
…quant)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts from NSA→DSA rename (sgl-project#25821):
- Migrate SM120 guards from nsa_indexer.py → dsa_indexer.py
- Migrate SM120 guards from nsa_backend.py → dsa_backend.py
- Move sm120_mqa_fallback.py, sm120_mqa_triton.py to dsa/ directory
- Update test import path to dsa.sm120_mqa_fallback
- Use is_sm120_supported() directly (no module-level _is_sm120 var)
…ctor

The NSA→DSA rename (PR sgl-project#25821) changed all backend attributes from
nsa_prefill_backend/nsa_decode_backend to dsa_prefill_backend/dsa_decode_backend,
but the SM120 branches in _set_default_dsa_backends() still used the old names,
silently creating orphan attributes while leaving dsa_* as None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add CI registration wrapper so test_sm120_mqa_fallback runs on every PR.
The tests are pure PyTorch and work on any CUDA GPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@samuellees
Copy link
Copy Markdown
Contributor

samuellees commented May 20, 2026

/tag-and-rerun-ci ++

@samuellees
Copy link
Copy Markdown
Contributor

@Fridge003
Copy link
Copy Markdown
Collaborator

max_seq_len: int,
clean_logits: bool = True,
) -> torch.Tensor:
"""CUDA-graph-compatible FP8 paged MQA logits (vectorized, no .item()).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't modify this function inplace. Please open a new function for fp8_paged_mqa for sm120 and put your implementation there

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Restored the original fp8_paged_mqa_logits_torch and created a separate fp8_paged_mqa_logits_torch_sm120 with the vectorized (no .item()) implementation. The dispatch in C4IndexerBackendMixin selects the SM120 variant when is_sm120_supported().

# SM120: auto-select marlin MoE backend (dispatches to SM120 Triton kernel)
from sglang.srt.utils.common import is_sm120_supported

if is_sm120_supported() and server_args.moe_runner_backend == "auto":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has already been set in server_args.py

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. The SM120 marlin auto-select is now only in server_args.py.

Comment thread python/sglang/srt/server_args.py Outdated
self.dsa_decode_backend = "tilelang"
elif kv_cache_dtype == "fp8_e4m3":
if major >= 10:
if is_sm120_supported():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dsa_prefill_backend and dsa_decode_backend are for glm-5 and dpsk v3.2. REmove their settings

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both SM120 branches from _set_default_dsa_backends().

@@ -0,0 +1,281 @@
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these tests to main function of sm120_mqa_fallback.py and sm120_mqa_triton.py. Don't register them as CI tests

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the test file and CI registration. The source files (sm120_mqa_fallback.py, sm120_mqa_triton.py) under dsa/ were also removed per comment #6 (dead code for V4), so there is no target to move tests into.

Comment thread python/sglang/srt/models/deepseek_v4.py Outdated
return y, post, comb, False

if envs.SGLANG_OPT_USE_TILELANG_MHC_PRE.get():
if envs.SGLANG_OPT_USE_TILELANG_MHC_PRE.get() and not is_sm120_supported():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding these is_sm120_supported() in the if condition, can we auto set these environs to false when we detect sm120

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. SM120 env auto-set is now in server_args.py under the DeepseekV4ForCausalLM block:

envs.SGLANG_OPT_FP8_WO_A_GEMM.set(False)
envs.SGLANG_OPT_USE_TOPK_V2.set(False)
envs.SGLANG_OPT_USE_TILELANG_MHC_PRE.set(False)
envs.SGLANG_OPT_DEEPGEMM_HC_PRENORM.set(False)
envs.SGLANG_FP8_PAGED_MQA_LOGITS_TORCH.set(True)

Removed all is_sm120_supported() guards from deepseek_v4.py. Also changed post_load_weights to use envs.SGLANG_OPT_FP8_WO_A_GEMM.get() instead of the module-level cached _FP8_WO_A_GEMM, since the env is set after module import.

from sglang.srt.layers.dp_attention import get_attention_tp_size
from sglang.srt.model_executor.forward_batch_info import ForwardBatch, ForwardMode
from sglang.srt.utils import is_cuda, is_hip
from sglang.srt.utils.common import is_sm120_supported
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is unrelated to dpsk v4. Don't modify it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted all changes.

if (_is_cuda and sm_version < 90) or (_is_musa and sm_version < 31):
return False
# DeepGEMM requires TMEM/tcgen05 (SM100+datacenter), not available on SM120
if is_sm120_supported():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sm_version == 120 here instead

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to if sm_version == 120: return False.

)[0]
)

o = flash_mla_with_kvcache_entrypoint(**input_dict, backend="kernel")[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't modify in this way. Please only use the newly created entrypoint when it's sm120

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured: SM120 branch calls flash_mla_with_kvcache_sm120 from the new standalone file; the original flash_mla.flash_mla_with_kvcache(...) call is preserved unchanged in the else branch.

@@ -0,0 +1,185 @@
"""PyTorch fallback for MXFP4 MoE GEMM on SM120.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it if it's not used anywhere

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed no imports. Deleted.

_use_triton_flashmla = os.environ.get("SGLANG_SM120_TRITON_FLASHMLA", "1") == "1"


def flash_mla_with_kvcache_entrypoint(backend: str, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to flash_mla_with_kvcache_sm120. Also, please move this function to a standalone file. It shouldn't be put in the fallback file

AliceChenyy and others added 2 commits May 21, 2026 04:13
…move dead code

- Restore original fp8_paged_mqa_logits_torch, add separate SM120 variant
- Move flash_mla SM120 impl to standalone flash_mla_sm120.py
- SM120-only dispatch in deepseek_v4_backend.py, original path untouched
- Auto-set SM120-incompatible envs in server_args.py (FP8_WO_A_GEMM,
  TOPK_V2, TILELANG_MHC_PRE, DEEPGEMM_HC_PRENORM, FP8_PAGED_MQA_LOGITS_TORCH)
- Remove is_sm120_supported() guards from deepseek_v4.py and metadata.py
- Use sm_version == 120 in configurer.py
- Revert DSA files (dsa_indexer.py, dsa_backend.py) — not used by V4
- Delete dead code: sm120_mqa_{fallback,triton}.py, mxfp4_moe_fallback.py,
  test_sm120_mqa_fallback.py, CI registration
- Remove redundant SM120 marlin block from deepseek_v4_hook.py
- Remove SM120 branches from _set_default_dsa_backends (GLM-5/V3.2 only)

Verified: GSM8K 10/10 on 8xRTX PRO 6000 (SM120), no manual env vars needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SM120 device to hardware table and deployment notes:
- RTX PRO 6000 (8 GPUs) for V4-Flash
- Docker image: lmsysorg/sglang:dev-cu13
- Launch command with auto-detected SM120 defaults
- Performance and accuracy notes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 21, 2026
@AliceChenyy
Copy link
Copy Markdown
Author

Done. Added SM120 (RTX PRO 6000) section to the DeepSeek-V4 cookbook — covers TP=8 (recommended) and TP=4 (memory-constrained) configurations, Docker image, and performance notes. See commit 21c00bc.

Comment thread python/sglang/jit_kernel/utils.py Outdated
if is_hip_runtime():
return False
return get_jit_cuda_arch().major >= 9
arch = get_jit_cuda_arch()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls revert this unnecessary change

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — reverted in cd44380.

…tils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation jit-kernel run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants